Housing Market - Paper2
Housing Market - Paper2
1 Introduction
This paper continues our discussion on Seattle housing prices from our first project. As mentioned in our earlier report, “Seattle boasts among ‘hottest’ housing markets in the country; as of July 2018, Seattle ‘led the nation in home price gains’ for 21 straight months.’” Given this context, our basic SMART question remains the same: how can we predict house prices in Seattle? This results to be a regression problem on the target variable of price, and we will apply 4 different approaches learned throughout the course to try to solve it.
Included in our discussion is some exploratory data analysis (EDA) from our first assignment, along with some new models, including KNN, ridge and lasso regression, PCA/PCR, and decision trees/random forest. Each model comes with its own advantages and disadvantages, and as such no single model offers total explanatory power. Our hope, however, is that the sum is greater than its parts.
2 EDA
The following are excerpts and graphs from the EDA section of our previous report. We are including them here to remind the reader of our dataset’s attributes, before we dive into the analysis.
## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7129300520 6414100192 5631500400 2487200875 1954400510 ...
## $ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
“A brief overview of the dataset yields the following observations for housing price: the minimum price is $78,000, while the maximum is $7,700,000 (quite a large range); the mean of the dataset is $540,198 (indicating that the dataset is right-skewed, as further indicated by the histogram below); the standard deviation of the dataset is $367,142; and the variance is 134,792,956,735 (quite large, indicating "that the data points are very spread out from the mean, and from one another".”
Below is a visualization of the points in the dataset by price on a map, plotted with the leaflet library. Note that the data have been divided by unequal bins to provide a better visualization of the distribution of housing price, so please read the legend carefully. More expensive houses tend to be concentrated near the water and center of the city.
2.1 Important Features Comparisons
From the scatterplot, it’s apparent that there is a relatively strong, positive correlation between housing price and living space (.70192, to be exact). That is, as living space increases, so does housing price. Note that a majority of the data points lie below 6,000 sqft, and below $2 million.
Here we have a boxplot comparing “grade” with housing price. “Grade” represents an index from 1 to 13, with the lowest number representing poor construction and design. The trend is clear: construction and design grade correlate positively with housing price.
3 KNN
Here, we convert pass “price” through a log function to better normalize its distribution (unnormalized, it has a heavy right skew). After passing “price” through the log function, we convert “price” to a factor variable and divide it into 3 categories (“Low”, “Medium”, and “High”) to prepare for KNN analysis.
## [1] 11.3 15.9
## Low Medium High
## 7215 13996 385
Next, we split the data into 80% training, and 20% test subsets.
## [1] 0.8
We then apply the chooseK() function to determine the best KNN k value for our dataset. Within the chooseK() function itself, I select features that are truly numeric (KNN requires predictor variables to be numeric). For instance, even though “yr_built” is classified as an “integer” data type, a concept such as year is best thought of as categorical, not numerical. As a result, I chose to exclude this and other similar variables from the analysis.
From the resulting graph, it becomes evident that 12 is approximately the best value for k: it offers the highest accuracy.
Now that we have our k value, we can run our KNN analysis, using the same features we fed the chooseK() function previously.
## Factor w/ 3 levels "Low","Medium",..: 1 2 2 2 2 2 2 2 2 2 ...
## price_predict
## Low Medium High
## 1251 3055 13
Now let’s take a look at the results. Our KNN model classified housing price correctly about 74% of the time, or 3199 out of a possible 4319 test cases.
It misclassified 385 medium priced houses as “low”, 661 low priced houses as “medium”, 4 medium priced houses as “high”, and 70 high priced houses “medium”.
KNN is a useful algorithm for classifying data points. We showed that at 74% accuracy, our algorithm successfully predicted housing price categories based on variables such as “bedrooms”, “bathrooms”, “sqft_living”, “sqft_lot”, “floors”, “sqft_above”, “sqft_basement”, “sqft_living15”, and “sqft_lot15”. If we were in the real estate market and wanted to know generally how high or low we should price a house, we could determine an answer based on these variables.
It should be noted, however, that KNN cannot be used to predict numeric prices, since the response variable (price) must be categorical. To predict specific prices, one must use linear regression or PCR.
##
## price_predict Low Medium High
## Low 830 421 0
## Medium 636 2346 73
## High 0 6 7
## [1] 830 2346 7
KNN results in an accuracy of 0.737 eventually.
4 Tree Modeling
We decided to also use a Decision Tree modeling approach for this regression problem, and see if we could also gain interesting visualizations of the dataset to predict the value of price on average.
4.1 Regression Tree
Using the ‘tree’ package, we built a tree on all the features used as predictors (i.e. excluding id, date, the various geographic variables and sqft_living15 and sqft_lot15) for price. We used the logarithmic of the price to provide an easier visualization of the tree and allowed more precide average values per node. This tree presented 9 termindal nodes and a mean squared error of 0.12.
From the plot of the tree, we can see that the algorithm splits the data using grade, yr_built and sqft_living. As expected as a general trend, if grade is better and sqft_living are larger, price would be higher. While if grade is lower and houses get smaller, price would then be likely to be lower. The splits also show an interesting fact that the age of house correlates positively with its price, as in splits based on yr_built the older the house the higher the average house price would be. As we used the log of price to predict price with this tree, we can see how the highest average of price in the farthest right leaf is 1289802.934 dollars and the lowest average of price in the farthest left is 273758.059 dollars.
Furthermore, we use the ‘rpart’ package to build fancier visualization of decision trees for this dataset.
These trees provide a fairly similar result from the classic tree, but from which we can observe the proportion of sample observations present in each terminal leaf (in this case only 7). Again, the variables used to build the tree are grade, sqft_living and yr_built. As expected, the right branches of the tree only contain 20% of the houses, as the most expensive houses with better grade than the rest; indeed, the houses that have better grande and sqft superior than 3757, result to be only 5% of the data. Whereas, the largest proportions are within the left branches given their lower grade: 80% of house are of grade between 3 and 8 and 52% of grade 3 to 7. The terminal node resulting with highest proportion is the one containing observations of house of grade 7 and built after 1953, namely 29% of the observations in the dataset.
On another note, we decided to experiment with trees and build one that would provide an average of prices per different geographic areas. Thus, using the longitude and latitude as predictors for the log of price, we built the following tree which shows the mean value of price per each are on a map, divided by nodes on longitude and latitude.
This visualization is helpful to note that higher prices are corresponding to the central area, i.e. Seattle downtown and adjacent zones, where the average price is estimated to be 1088161.355, while areas on the outskirts of the region shows average values of 327747.902 or 296558.565, as one would normally expect.
4.2 Prune tree
Subsequently, we pruned the tree down with a normal ‘prune’ function.
First of all, we can observe the plot of a sequence of differenct pruned trees’ sizes versus their error rates. The vector of these error rates for the pruning of the tree in our case resulted in 2593.61, 2668.388, 2744.88, 2875.801, 3268.008, 3489.958, 4016.003, 5984.489.
As we intended to represent the smallest optimal tree among these pruned trees, this appeared to be of a size equal to 9. The plot then shows a pruned tree with 6 terminal nodes, and splits operated again on grade, sqft_living and yr_built. This tree results simpler to read given the fewer split, although it does not add predictive power while actually taking away some important splits from a tree model that already simplifies the dataset significantly. Thus, we believed it was better to use a non-pruned tree for testing our model.
4.3 Testing model
First, to test the model perfomance, we divided the dataset in training and test set, where the latter is simply the first fold of the data.
##
## Regression tree:
## rpart(formula = log(price) ~ bedrooms + bathrooms + sqft_living +
## sqft_lot + floors + condition + grade + sqft_above + sqft_basement +
## yr_built + yr_renovated, data = train.set)
##
## Variables actually used in tree construction:
## [1] grade sqft_living yr_built
##
## Root node error: 5390/19418 = 0.3
##
## n= 19418
##
## CP nsplit rel error xerror xstd
## 1 0.33 0 1.0 1.0 0.012
## 2 0.09 1 0.7 0.7 0.007
## 3 0.04 2 0.6 0.6 0.007
## 4 0.03 3 0.5 0.5 0.006
## 5 0.02 5 0.5 0.5 0.005
## 6 0.01 6 0.5 0.5 0.005
## 7 0.01 7 0.4 0.5 0.005
## 8 0.01 8 0.4 0.4 0.005
## [1] 0.01
## Call:
## rpart(formula = log(price) ~ bedrooms + bathrooms + sqft_living +
## sqft_lot + floors + condition + grade + sqft_above + sqft_basement +
## yr_built + yr_renovated, data = train.set)
## n= 19418
##
## CP nsplit rel error xerror xstd
## 1 0.3284 0 1.000 1.000 0.01173
## 2 0.0887 1 0.672 0.672 0.00737
## 3 0.0377 2 0.583 0.583 0.00655
## 4 0.0324 3 0.545 0.548 0.00586
## 5 0.0217 5 0.480 0.497 0.00545
## 6 0.0200 6 0.459 0.472 0.00527
##
## Variable importance
## grade sqft_living sqft_above bathrooms yr_built
## 43 18 18 9 7
## floors sqft_basement bedrooms
## 2 2 1
##
## Node number 1: 19418 observations, complexity param=0.328
## mean=13, MSE=0.278
## left son=2 (15556 obs) right son=3 (3862 obs)
## Primary splits:
## grade splits as LLLLLLRRRRR, improve=0.328, (0 missing)
## sqft_living < 2440 to the left, improve=0.317, (0 missing)
## sqft_above < 2000 to the left, improve=0.233, (0 missing)
## bathrooms < 2.62 to the left, improve=0.181, (0 missing)
## bedrooms < 3.5 to the left, improve=0.112, (0 missing)
## Surrogate splits:
## sqft_above < 2500 to the left, agree=0.883, adj=0.413, (0 split)
## sqft_living < 2920 to the left, agree=0.880, adj=0.395, (0 split)
## bathrooms < 3.12 to the left, agree=0.839, adj=0.191, (0 split)
## sqft_basement < 1520 to the left, agree=0.808, adj=0.035, (0 split)
## yr_built < 2010 to the left, agree=0.802, adj=0.004, (0 split)
##
## Node number 2: 15556 observations, complexity param=0.0887
## mean=12.9, MSE=0.18
## left son=4 (10110 obs) right son=5 (5446 obs)
## Primary splits:
## grade splits as LLLLLR-----, improve=0.1710, (0 missing)
## sqft_living < 2000 to the left, improve=0.1660, (0 missing)
## bathrooms < 1.62 to the left, improve=0.0974, (0 missing)
## sqft_above < 1410 to the left, improve=0.0972, (0 missing)
## sqft_basement < 75 to the left, improve=0.0688, (0 missing)
## Surrogate splits:
## bathrooms < 2.12 to the left, agree=0.735, adj=0.244, (0 split)
## sqft_above < 1770 to the left, agree=0.735, adj=0.244, (0 split)
## sqft_living < 2140 to the left, agree=0.735, adj=0.244, (0 split)
## floors < 1.75 to the left, agree=0.725, adj=0.214, (0 split)
## yr_built < 1990 to the left, agree=0.707, adj=0.162, (0 split)
##
## Node number 3: 3862 observations, complexity param=0.0377
## mean=13.7, MSE=0.212
## left son=6 (2950 obs) right son=7 (912 obs)
## Primary splits:
## sqft_living < 3760 to the left, improve=0.248, (0 missing)
## grade splits as ------LRRRR, improve=0.227, (0 missing)
## bathrooms < 3.12 to the left, improve=0.190, (0 missing)
## sqft_above < 3840 to the left, improve=0.151, (0 missing)
## sqft_basement < 558 to the left, improve=0.120, (0 missing)
## Surrogate splits:
## sqft_above < 3760 to the left, agree=0.900, adj=0.576, (0 split)
## grade splits as ------LLRRR, agree=0.836, adj=0.304, (0 split)
## bathrooms < 3.62 to the left, agree=0.827, adj=0.266, (0 split)
## sqft_basement < 1220 to the left, agree=0.803, adj=0.166, (0 split)
## bedrooms < 5.5 to the left, agree=0.774, adj=0.042, (0 split)
##
## Node number 4: 10110 observations, complexity param=0.0324
## mean=12.8, MSE=0.158
## left son=8 (2066 obs) right son=9 (8044 obs)
## Primary splits:
## grade splits as LLLLR------, improve=0.1050, (0 missing)
## sqft_living < 1500 to the left, improve=0.1020, (0 missing)
## sqft_basement < 30 to the left, improve=0.0818, (0 missing)
## yr_built < 1930 to the right, improve=0.0616, (0 missing)
## bathrooms < 1.62 to the left, improve=0.0572, (0 missing)
## Surrogate splits:
## sqft_living < 935 to the left, agree=0.836, adj=0.196, (0 split)
## sqft_above < 815 to the left, agree=0.826, adj=0.150, (0 split)
## bedrooms < 1.5 to the left, agree=0.802, adj=0.033, (0 split)
## bathrooms < 0.875 to the left, agree=0.799, adj=0.017, (0 split)
## condition splits as LLRRR, agree=0.798, adj=0.009, (0 split)
##
## Node number 5: 5446 observations, complexity param=0.0217
## mean=13.1, MSE=0.134
## left son=10 (4195 obs) right son=11 (1251 obs)
## Primary splits:
## yr_built < 1960 to the right, improve=0.1610, (0 missing)
## sqft_living < 2440 to the left, improve=0.0900, (0 missing)
## sqft_basement < 465 to the left, improve=0.0689, (0 missing)
## condition splits as RLLLR, improve=0.0491, (0 missing)
## yr_renovated < 978 to the left, improve=0.0409, (0 missing)
## Surrogate splits:
## yr_renovated < 978 to the left, agree=0.798, adj=0.120, (0 split)
## bathrooms < 1.62 to the right, agree=0.790, adj=0.088, (0 split)
## condition splits as RLLLR, agree=0.787, adj=0.074, (0 split)
## sqft_living < 4380 to the left, agree=0.771, adj=0.003, (0 split)
## sqft_lot < 440000 to the left, agree=0.771, adj=0.002, (0 split)
##
## Node number 6: 2950 observations
## mean=13.5, MSE=0.142
##
## Node number 7: 912 observations
## mean=14.1, MSE=0.218
##
## Node number 8: 2066 observations
## mean=12.5, MSE=0.156
##
## Node number 9: 8044 observations, complexity param=0.0324
## mean=12.8, MSE=0.137
## left son=18 (5610 obs) right son=19 (2434 obs)
## Primary splits:
## yr_built < 1950 to the right, improve=0.1650, (0 missing)
## sqft_living < 2000 to the left, improve=0.0766, (0 missing)
## sqft_lot < 6520 to the right, improve=0.0662, (0 missing)
## sqft_basement < 30 to the left, improve=0.0604, (0 missing)
## condition splits as LLLLR, improve=0.0240, (0 missing)
## Surrogate splits:
## bedrooms < 2.5 to the right, agree=0.736, adj=0.128, (0 split)
## sqft_above < 955 to the right, agree=0.725, adj=0.092, (0 split)
## yr_renovated < 970 to the left, agree=0.718, adj=0.069, (0 split)
## bathrooms < 1.12 to the right, agree=0.717, adj=0.063, (0 split)
## sqft_living < 955 to the right, agree=0.712, adj=0.048, (0 split)
##
## Node number 10: 4195 observations
## mean=13.1, MSE=0.105
##
## Node number 11: 1251 observations
## mean=13.4, MSE=0.136
##
## Node number 18: 5610 observations
## mean=12.7, MSE=0.11
##
## Node number 19: 2434 observations
## mean=13.1, MSE=0.125
## NULL
## [1] Inf
Building the tress on training dataset, we obtain a slighlty different tree pruned to 5 leaves, which splits data on grade first and then sqft_living. The plot of errors versus size also points out an optimal tree at 34 nodes.
## n= 19418
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 19418 5390 13.0
## 2) grade=3,4,5,6,7,8 15556 2800 12.9
## 4) grade=3,4,5,6,7 10110 1590 12.8
## 8) grade=3,4,5,6 2066 323 12.5 *
## 9) grade=7 8044 1100 12.8
## 18) yr_built>=1.95e+03 5610 618 12.7 *
## 19) yr_built< 1.95e+03 2434 305 13.1 *
## 5) grade=8 5446 727 13.1
## 10) yr_built>=1.96e+03 4195 440 13.1 *
## 11) yr_built< 1.96e+03 1251 171 13.4 *
## 3) grade=9,10,11,12,13 3862 820 13.7
## 6) sqft_living< 3.76e+03 2950 418 13.5 *
## 7) sqft_living>=3.76e+03 912 199 14.1 *
Evaluating this tree on test data, we can see how the trained model did a good job at predicting price for the dataset as errors and tree are almost identical.
## [1] 413761889192
4.4 Random Forest
Ultimately, we used a Random Forest algorithm to evaluate if this type of ensembling could increase the perfomance of the tree model on our dataset.
##
## Call:
## randomForest(formula = log(price) ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + condition + grade + sqft_above + sqft_basement + yr_built + yr_renovated, data = train.set, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.0871
## % Var explained: 68.6
Thus, we ran a regression random forest on all the predictors variables for the logarithmic of price as target. The model builds 500 trees ensembled with 3 variables tried at each split, and it eventually provides a 68.5% of variance explained and a MSE of 0.0875.
5 PCA
Here, we use principle component analysis to predict house price. We chose this method because there are too many variables, and did not know which ones to use. We want to capture as much information as possible by the fewest number of variables.
5.1 Subset Data
We deleted some unuseful variables which we presumed were uncorrelated to house price such as lattitude, longitude, the 15 neighborhoods’ sqft_living , sqft_lot, the year of renovated, zipcode, date of record, and id. Also, there are several variables containing mostly 0 values, so we also deleted them (these include “view” and “waterfront”). We then took a look at the datset.
## 'data.frame': 21613 obs. of 11 variables:
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
There are 11 variables left and one variable is house, which will be used as an independent variable. All variables are numeric.
Next, we need to check if there are NA values in the dataset.
There are no NA values.
## bedrooms bathrooms sqft_living sqft_lot floors condition
## bedrooms 1.0000 0.5159 0.5767 0.03170 0.1754 0.02847
## bathrooms 0.5159 1.0000 0.7547 0.08774 0.5007 -0.12498
## sqft_living 0.5767 0.7547 1.0000 0.17283 0.3539 -0.05875
## sqft_lot 0.0317 0.0877 0.1728 1.00000 -0.0052 -0.00896
## floors 0.1754 0.5007 0.3539 -0.00520 1.0000 -0.26377
## condition 0.0285 -0.1250 -0.0588 -0.00896 -0.2638 1.00000
## grade 0.3570 0.6650 0.7627 0.11362 0.4582 -0.14467
## sqft_above 0.4776 0.6853 0.8766 0.18351 0.5239 -0.15821
## sqft_basement 0.3031 0.2838 0.4350 0.01529 -0.2457 0.17410
## yr_built 0.1542 0.5060 0.3180 0.05308 0.4893 -0.36142
## grade sqft_above sqft_basement yr_built
## bedrooms 0.357 0.4776 0.3031 0.1542
## bathrooms 0.665 0.6853 0.2838 0.5060
## sqft_living 0.763 0.8766 0.4350 0.3180
## sqft_lot 0.114 0.1835 0.0153 0.0531
## floors 0.458 0.5239 -0.2457 0.4893
## condition -0.145 -0.1582 0.1741 -0.3614
## grade 1.000 0.7559 0.1684 0.4470
## sqft_above 0.756 1.0000 -0.0519 0.4239
## sqft_basement 0.168 -0.0519 1.0000 -0.1331
## yr_built 0.447 0.4239 -0.1331 1.0000
Next, we took a look at the correlation of the 10 dependent variables, and we found that they are all related, and some are highly related. This demonstrates that it is difficult to determine which ones are important, so we used PCA to reduce dimensionality.
5.2 PCA part
As the 10 variables have different scales, it is necessary to scale them before analysis. Performing PCA on un-normalized variables will heavily weight variables with high variances.
We used the prcomp function to perform PCA, and subsequently checked the mean and sd of the variables. Since we scaled the variables, the differences in mean and sd are not large.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.071 1.322 1.007 0.9159 0.7926 0.7527 0.6683
## Proportion of Variance 0.429 0.175 0.101 0.0839 0.0628 0.0567 0.0447
## Cumulative Proportion 0.429 0.604 0.705 0.7890 0.8518 0.9085 0.9531
## PC8 PC9 PC10
## Standard deviation 0.5060 0.4611 0.0000000000000139
## Proportion of Variance 0.0256 0.0213 0.0000000000000000
## Cumulative Proportion 0.9787 1.0000 1.0000000000000000
## bedrooms bathrooms sqft_living
## 0.0000000000000002143 -0.0000000000000001689 0.0000000000000002410
## sqft_lot floors condition
## 0.0000000000000000132 -0.0000000000000000227 -0.0000000000000002160
## grade sqft_above sqft_basement
## 0.0000000000000002022 0.0000000000000001110 0.0000000000000000207
## yr_built
## 0.0000000000000019023
## bedrooms bathrooms sqft_living sqft_lot floors
## 0.930 0.770 918.441 41420.512 0.540
## condition grade sqft_above sqft_basement yr_built
## 0.651 1.175 828.091 442.575 29.373
We can see that with 7 components, 95% of the variance is explained.
Let’s take a look at how the variables form each component.
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## bedrooms 0.2873 -0.3055 0.1563 -0.0402 0.7428 -0.3876 0.0572
## bathrooms 0.4212 -0.0715 0.0984 0.0443 -0.1394 -0.2240 -0.1461
## sqft_living 0.4353 -0.2450 -0.0401 0.0120 -0.0186 0.2590 0.0456
## sqft_lot 0.0800 -0.0511 -0.9540 0.1245 0.0366 -0.1928 -0.1442
## floors 0.2961 0.3859 0.1091 -0.2784 -0.0676 -0.1257 -0.7552
## condition -0.1103 -0.4417 -0.0495 -0.7460 -0.3469 -0.3140 0.1046
## grade 0.4091 -0.0096 -0.0230 -0.0549 -0.2796 0.3578 0.1993
## sqft_above 0.4318 0.0462 -0.1214 -0.2472 0.1525 0.3205 0.2049
## sqft_basement 0.0954 -0.5949 0.1439 0.4875 -0.3240 -0.0621 -0.2887
## yr_built 0.2853 0.3726 0.0624 0.2121 -0.3096 -0.5886 0.4540
## PC8 PC9 PC10
## bedrooms 0.2925 0.0855 0.0000000000000014791
## bathrooms -0.6178 0.5773 0.0000000000000023896
## sqft_living -0.1601 -0.4058 -0.6992603665794965284
## sqft_lot 0.0600 0.0500 -0.0000000000000001189
## floors 0.2221 -0.1840 0.0000000000000002764
## condition 0.0313 -0.0540 0.0000000000000001844
## grade 0.6000 0.4723 0.0000000000000002236
## sqft_above -0.2521 -0.3265 0.6304719252550327058
## sqft_basement 0.1395 -0.2312 0.3369571058700506772
## yr_built 0.1010 -0.2691 0.0000000000000000411
We can see that sqft_above forms up nearly 43% of the first component. The second component consists mostly of sqft_basement, which hovers near 60%.
We can visualize the variance explained by each component. We can see that the first component explains the most, while the subsequent ones explain less.
We can visualize how each component is formed by the different variables. However, the graph is very hard to see – much harder than the rotation vizualization.
Visualizing the cumulative proportions of variance, we can see that after 7 components, the curve becomes smooth.
5.3 Predicting using PCA
Finally, we created a PCR model wth the 7 components. Actually, the PCR model is a linear model, so we used the lm function to create it.
##
## Call:
## lm(formula = house2$price ~ houseprice[, 1:7])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1353938 -117818 -11744 91786 4213791
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 540088 1554 347.62 < 0.0000000000000002 ***
## houseprice[, 1:7]PC1 109184 750 145.53 < 0.0000000000000002 ***
## houseprice[, 1:7]PC2 -78959 1175 -67.21 < 0.0000000000000002 ***
## houseprice[, 1:7]PC3 -9625 1543 -6.24 0.00000000046 ***
## houseprice[, 1:7]PC4 -37261 1696 -21.97 < 0.0000000000000002 ***
## houseprice[, 1:7]PC5 -58357 1960 -29.77 < 0.0000000000000002 ***
## houseprice[, 1:7]PC6 171526 2064 83.10 < 0.0000000000000002 ***
## houseprice[, 1:7]PC7 -34516 2325 -14.85 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 228000 on 21605 degrees of freedom
## Multiple R-squared: 0.613, Adjusted R-squared: 0.613
## F-statistic: 4.89e+03 on 7 and 21605 DF, p-value: <0.0000000000000002
We can see that 61.3% of the variance is explained by the independent variables. We do not use accuracy to check whether the model is good or not – the reason being that the difference between the house prices is very large. So using accuracy to check the model would not make sense.
We want to see how good the PCR model is, so we created a full linear model to compare it to.
##
## Call:
## lm(formula = price ~ . - sqft_basement, data = house2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1330687 -116072 -12249 88891 4428176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6899404.9190 135999.6452 50.73 < 0.0000000000000002 ***
## bedrooms -49150.7573 2112.1474 -23.27 < 0.0000000000000002 ***
## bathrooms 49461.3702 3632.5990 13.62 < 0.0000000000000002 ***
## sqft_living 203.6879 4.7239 43.12 < 0.0000000000000002 ***
## sqft_lot -0.2216 0.0384 -5.77 0.00000000780726 ***
## floors 28958.5249 3915.4655 7.40 0.00000000000015 ***
## condition 18383.6066 2586.6454 7.11 0.00000000000122 ***
## grade 131536.3956 2251.9728 58.41 < 0.0000000000000002 ***
## sqft_above -21.9900 4.6116 -4.77 0.00000186852828 ***
## yr_built -3953.4750 69.7857 -56.65 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 227000 on 21603 degrees of freedom
## Multiple R-squared: 0.618, Adjusted R-squared: 0.618
## F-statistic: 3.88e+03 on 9 and 21603 DF, p-value: <0.0000000000000002
61.8% of the variance is explained by the independent variable in the full linear model. It is a bit better than the PCR model.
5.4 PCR vs. Full Linear Model: A Comparison
The R^2 of the full model is 0.618, which is higher than PCR at 0.613. We can also see that both models underestimate the value of house prices over 4e+06.
5.5 Limitations of PCA
There are a few price outliers. We did not delete them, however, because we think the high prices are important. The response variables must be numeric, even though the grade and condition variables are actually catagorical. PCA also relies on the assumption of normality, but our dataset is heavily right skewed.
6 Ridge and Lasso
6.1 The Ridge
For our dataset, numbers of bedrooms, bathrooms, and floors are categorical variables, so we convert them into factors. Then we prepare a log scale grid for λ values, from 10^10 to 10^-2 in 100 segments, and then build the ridge model. Afterwards, we draw a plot of coefficients to see the overall trend.
## [1] 64 100
The glmnet( ) function creates 100 models, with our choice of 100 \(\lambda\) values. Each model’s coefficients are stored in the object we named: ridge.mod
There are 55 coefficients for each model. The 100 \(\lambda\) values are chosen from 0.02 (\(10^{-2}\)) to \(10^{10}\), essentially covering the ordinary least square model (\(\lambda\) = 0), and the null/constant model (\(\lambda\) approaches infinity).
Because the ridge regression uses the “L2 norm”, the coefficients are expected to be smaller when \(\lambda\) is large. Our “midpoint” (the 50th percentile) of \(\lambda\) equals 11497.57, and the sum of squares of coefficients is 0.002. Compared to the 60th percentile value (we have a decreasing sequence) \(\lambda\) of 705.48, we find the sum of squares of the coefficients to be 0.04, about 16 times larger.
The model, however, only has 100 different values of \(\lambda\) recorded, so we can use the predict function (part of the R basic stats library) for various different purposes, such as calculating the predicted coefficients for \(\lambda\)=50, for example.
## (Intercept) bedrooms2 bedrooms3 bedrooms4 bedrooms5
## -0.0006179 -0.0076927 -0.0067785 0.0068163 0.0134016
## bedrooms6 bedrooms7 bedrooms8 bedrooms9 bedrooms10
## 0.0144712 0.0202850 0.0290019 0.0175400 0.0137793
## bedrooms11 bathrooms0.75 bathrooms1 bathrooms1.25 bathrooms1.5
## -0.0019261 -0.0123238 -0.0116420 0.0047366 -0.0070193
## bathrooms1.75 bathrooms2 bathrooms2.25 bathrooms2.5 bathrooms2.75
## -0.0049456 -0.0045507 -0.0004188 0.0004906 0.0062803
## bathrooms3 bathrooms3.25 bathrooms3.5 bathrooms3.75 bathrooms4
## 0.0087716 0.0227680 0.0205608 0.0341827 0.0377144
## bathrooms4.25 bathrooms4.5 bathrooms4.75 bathrooms5 bathrooms5.25
## 0.0512497 0.0410413 0.0774969 0.0588491 0.0665617
## bathrooms5.5 bathrooms5.75 bathrooms6 bathrooms6.25 bathrooms6.5
## 0.1033113 0.1011313 0.1264703 0.1325946 0.0590444
## bathrooms6.75 bathrooms7.5 bathrooms7.75 bathrooms8 sqft_living
## 0.1137092 -0.0062946 0.3369097 0.2325487 0.0132109
## sqft_lot floors1.5 floors2 floors2.5 floors3
## 0.0016311 0.0014869 0.0085804 0.0272498 0.0025484
## floors3.5 condition2 condition3 condition4 condition5
## 0.0195488 -0.0108018 -0.0000823 -0.0010613 0.0044430
## grade4 grade5 grade6 grade7 grade8
## -0.0166845 -0.0150763 -0.0133822 -0.0120699 -0.0000254
6.1.1 Train and Test sets
Let us split the data into training and test sets, so that we can estimate test errors. The split will be used here for Ridge regression, and later for Lasso regression.
The test set mean squared error (MSE) is 0.571. (Keep in mind that we are using standardized scores for \(\lambda = 4\).)
On the other hand, for the null model (\(\lambda\) approaches infinity), the MSE can be found to be 0.978. So \(\lambda = 4\) reduces the variance by about half, at the expense of bias.
We could have also used a large \(\lambda\) value to find the MSE for the null model. These two methods yield essentially the same answer of 0.978.
## [1] 0.34
## (Intercept) bedrooms2 bedrooms3 bedrooms4 bedrooms5
## 0.04818 0.09171 -0.04770 -0.14763 -0.12726
## bedrooms6 bedrooms7 bedrooms8 bedrooms9 bedrooms10
## -0.29521 -0.33546 -0.29151 -0.44392 -0.38316
## bedrooms11 bathrooms0.75 bathrooms1 bathrooms1.25 bathrooms1.5
## -0.97528 -0.02447 -0.04747 -0.41652 -0.05375
## bathrooms1.75 bathrooms2 bathrooms2.25 bathrooms2.5 bathrooms2.75
## -0.04096 -0.04077 0.01444 -0.03759 -0.00733
## bathrooms3 bathrooms3.25 bathrooms3.5 bathrooms3.75 bathrooms4
## 0.07802 0.22207 0.13623 0.50483 0.18360
## bathrooms4.25 bathrooms4.5 bathrooms4.75 bathrooms5 bathrooms5.25
## 0.60370 0.62950 2.39920 1.00729 1.15719
## bathrooms5.5 bathrooms5.75 bathrooms6 bathrooms6.25 bathrooms6.5
## 2.22582 2.05909 1.96062 3.91087 0.29896
## bathrooms6.75 bathrooms7.5 bathrooms7.75 bathrooms8 sqft_living
## 1.01433 0.00000 0.00000 0.00000 0.19787
## sqft_lot floors1.5 floors2 floors2.5 floors3
## -0.03167 0.04810 0.06720 0.17923 0.40364
## floors3.5 condition2 condition3 condition4 condition5
## 0.56829 -0.03787 0.00224 0.05391 0.20033
## grade4 grade5 grade6 grade7 grade8
## -0.68443 -0.70236 -0.55870 -0.29205 -0.01155
Now for the other extreme special case of small \(\lambda\), which is the ordinary least square (OLS) model. We can first use the ridge regression result to predict the \(\lambda\) =0 case. The MSE was found to be 0.34 using this result.
We can also build the OLS model directly.
##
## Call:
## lm(formula = price ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.441 -0.295 -0.033 0.229 11.815
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.14750 0.47833 -2.40 0.0165 *
## bedrooms2 0.02108 0.06851 0.31 0.7583
## bedrooms3 -0.12205 0.06861 -1.78 0.0753 .
## bedrooms4 -0.22602 0.06990 -3.23 0.0012 **
## bedrooms5 -0.20698 0.07313 -2.83 0.0047 **
## bedrooms6 -0.37573 0.08861 -4.24 0.0000224913062269 ***
## bedrooms7 -0.42131 0.15453 -2.73 0.0064 **
## bedrooms8 -0.37277 0.25578 -1.46 0.1450
## bedrooms9 -0.54117 0.35390 -1.53 0.1263
## bedrooms10 -0.46958 0.42841 -1.10 0.2731
## bedrooms11 -1.06635 0.60147 -1.77 0.0763 .
## bathrooms0.75 0.26012 0.43802 0.59 0.5526
## bathrooms1 0.25266 0.42303 0.60 0.5503
## bathrooms1.25 -0.14773 0.54481 -0.27 0.7863
## bathrooms1.5 0.24768 0.42382 0.58 0.5590
## bathrooms1.75 0.26148 0.42360 0.62 0.5371
## bathrooms2 0.26222 0.42375 0.62 0.5361
## bathrooms2.25 0.31744 0.42396 0.75 0.4540
## bathrooms2.5 0.26635 0.42394 0.63 0.5298
## bathrooms2.75 0.29587 0.42450 0.70 0.4858
## bathrooms3 0.38135 0.42486 0.90 0.3694
## bathrooms3.25 0.52316 0.42551 1.23 0.2189
## bathrooms3.5 0.43729 0.42540 1.03 0.3040
## bathrooms3.75 0.80645 0.42993 1.88 0.0607 .
## bathrooms4 0.48134 0.42968 1.12 0.2626
## bathrooms4.25 0.90334 0.43485 2.08 0.0378 *
## bathrooms4.5 0.93206 0.43407 2.15 0.0318 *
## bathrooms4.75 2.71209 0.47125 5.76 0.0000000088988964 ***
## bathrooms5 1.31154 0.46329 2.83 0.0046 **
## bathrooms5.25 1.45563 0.50302 2.89 0.0038 **
## bathrooms5.5 2.53299 0.49379 5.13 0.0000002952291571 ***
## bathrooms5.75 2.36784 0.55791 4.24 0.0000221296117797 ***
## bathrooms6 2.25868 0.52767 4.28 0.0000188134408722 ***
## bathrooms6.25 4.24288 0.73613 5.76 0.0000000084529191 ***
## bathrooms6.5 0.58420 0.73839 0.79 0.4289
## bathrooms6.75 1.30589 0.60838 2.15 0.0319 *
## sqft_living 0.46228 0.01692 27.32 < 0.0000000000000002 ***
## sqft_lot -0.03222 0.00578 -5.58 0.0000000248160754 ***
## floors1.5 0.04522 0.02279 1.98 0.0473 *
## floors2 0.06952 0.01921 3.62 0.0003 ***
## floors2.5 0.17378 0.06724 2.58 0.0098 **
## floors3 0.41135 0.03874 10.62 < 0.0000000000000002 ***
## floors3.5 0.57692 0.42213 1.37 0.1718
## condition2 0.19765 0.18269 1.08 0.2793
## condition3 0.23938 0.17339 1.38 0.1674
## condition4 0.29027 0.17346 1.67 0.0943 .
## condition5 0.43625 0.17433 2.50 0.0123 *
## grade5 0.01393 0.15287 0.09 0.9274
## grade6 0.16029 0.14467 1.11 0.2679
## grade7 0.43578 0.14477 3.01 0.0026 **
## grade8 0.72084 0.14563 4.95 0.0000007537853680 ***
## grade9 1.14077 0.14707 7.76 0.0000000000000095 ***
## grade10 1.63552 0.14935 10.95 < 0.0000000000000002 ***
## grade11 2.26268 0.15543 14.56 < 0.0000000000000002 ***
## grade12 3.39286 0.17457 19.44 < 0.0000000000000002 ***
## grade13 4.12985 0.30443 13.57 < 0.0000000000000002 ***
## sqft_above -0.08974 0.01531 -5.86 0.0000000046825032 ***
## sqft_basement NA NA NA NA
## yr_built -0.23916 0.00927 -25.80 < 0.0000000000000002 ***
## yr_renovated 0.04127 0.00624 6.61 0.0000000000393945 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.596 on 10739 degrees of freedom
## Multiple R-squared: 0.655, Adjusted R-squared: 0.653
## F-statistic: 351 on 58 and 10739 DF, p-value: <0.0000000000000002
The MSE for OLS regression is 0.353
6.1.2 Use Cross-validation
We use a built-in cross-validation method with glmnet, which will select the minimal \(\lambda\) value.
The minimal \(\lambda\) value minimizing training MSE results to be 0.071 in this case.
## [1] 0.342
## (Intercept) bedrooms2 bedrooms3 bedrooms4 bedrooms5
## 0.04348 0.12109 0.00438 -0.07223 -0.06973
## bedrooms6 bedrooms7 bedrooms8 bedrooms9 bedrooms10
## -0.22814 -0.51888 0.01736 -0.45761 -0.43431
## bedrooms11 bathrooms0.75 bathrooms1 bathrooms1.25 bathrooms1.5
## -0.80846 -0.00521 -0.06374 0.23833 -0.05507
## bathrooms1.75 bathrooms2 bathrooms2.25 bathrooms2.5 bathrooms2.75
## -0.04347 -0.04695 0.00590 -0.04798 0.00637
## bathrooms3 bathrooms3.25 bathrooms3.5 bathrooms3.75 bathrooms4
## 0.06353 0.27259 0.13974 0.48025 0.41569
## bathrooms4.25 bathrooms4.5 bathrooms4.75 bathrooms5 bathrooms5.25
## 0.69822 0.52011 1.42889 0.93785 1.24494
## bathrooms5.5 bathrooms5.75 bathrooms6 bathrooms6.25 bathrooms6.5
## 1.63620 1.17970 2.68379 1.48659 -0.27442
## bathrooms6.75 bathrooms7.5 bathrooms7.75 bathrooms8 sqft_living
## 1.24617 -0.11642 9.60917 3.84458 0.19455
## sqft_lot floors1.5 floors2 floors2.5 floors3
## -0.02574 0.06864 0.06276 0.35401 0.32881
## floors3.5 condition2 condition3 condition4 condition5
## 0.50963 -0.11889 -0.05088 0.01750 0.14366
## grade4 grade5 grade6 grade7 grade8
## -0.56617 -0.58286 -0.49016 -0.27727 -0.01932
## grade9 grade10 grade11 grade12 grade13
## 0.36524 0.81265 1.42433 2.54973 4.30675
## sqft_above sqft_basement yr_built yr_renovated
## 0.14404 0.11795 -0.20027 0.04575
The first vertical dotted line shows that the lowest MSE is 0.342. The second vertical dotted line is within one standard error. Then we calculate the R squared value. R squared is 0.65 for the ridge model.
6.2 The Lasso
The same function, glmnet( ), with alpha set to 1 will build the Lasso regression model. Then we draw the plot for different \(\lambda\) values to see the overall trend.
Here, we see that the lowest MSE is when \(\lambda\) equals 0.369. It has about 47 non-zero coefficients.
## [1] 0.344
## (Intercept) bedrooms2 bedrooms3 bedrooms4 bedrooms5
## 0.01286 0.09360 0.00000 -0.04818 0.00000
## bedrooms6 bedrooms7 bedrooms8 bedrooms9 bedrooms10
## -0.07138 -0.11450 0.00000 0.00000 0.00000
## bedrooms11 bathrooms0.75 bathrooms1 bathrooms1.25 bathrooms1.5
## 0.00000 0.00000 0.00000 0.00000 0.00000
## bathrooms1.75 bathrooms2 bathrooms2.25 bathrooms2.5 bathrooms2.75
## 0.00000 0.00000 0.00000 -0.02710 0.00000
## bathrooms3 bathrooms3.25 bathrooms3.5 bathrooms3.75 bathrooms4
## 0.00000 0.18498 0.03274 0.29635 0.18297
## bathrooms4.25 bathrooms4.5 bathrooms4.75 bathrooms5 bathrooms5.25
## 0.42824 0.23767 0.99029 0.46857 0.67974
## bathrooms5.5 bathrooms5.75 bathrooms6 bathrooms6.25 bathrooms6.5
## 0.91052 0.06104 1.95610 0.06491 0.00000
## bathrooms6.75 bathrooms7.5 bathrooms7.75 bathrooms8 sqft_living
## 0.00000 0.00000 8.02174 2.24596 0.39307
## sqft_lot floors1.5 floors2 floors2.5 floors3
## -0.01899 0.00288 0.02334 0.22601 0.31298
## floors3.5 condition2 condition3 condition4 condition5
## 0.01047 -0.03239 -0.03053 0.00000 0.11863
## grade4 grade5 grade6 grade7 grade8
## -0.28065 -0.51219 -0.47415 -0.26078 0.00000
## grade9 grade10 grade11 grade12 grade13
## 0.37851 0.85741 1.52279 2.72151 4.68689
## sqft_above sqft_basement yr_built yr_renovated
## 0.00000 0.02922 -0.20745 0.03924
## (Intercept) bedrooms2 bedrooms4 bedrooms6 bedrooms7
## 0.01286 0.09360 -0.04818 -0.07138 -0.11450
## bathrooms2.5 bathrooms3.25 bathrooms3.5 bathrooms3.75 bathrooms4
## -0.02710 0.18498 0.03274 0.29635 0.18297
## bathrooms4.25 bathrooms4.5 bathrooms4.75 bathrooms5 bathrooms5.25
## 0.42824 0.23767 0.99029 0.46857 0.67974
## bathrooms5.5 bathrooms5.75 bathrooms6 bathrooms6.25 bathrooms7.75
## 0.91052 0.06104 1.95610 0.06491 8.02174
## bathrooms8 sqft_living sqft_lot floors1.5 floors2
## 2.24596 0.39307 -0.01899 0.00288 0.02334
## floors2.5 floors3 floors3.5 condition2 condition3
## 0.22601 0.31298 0.01047 -0.03239 -0.03053
## condition5 grade4 grade5 grade6 grade7
## 0.11863 -0.28065 -0.51219 -0.47415 -0.26078
## grade9 grade10 grade11 grade12 grade13
## 0.37851 0.85741 1.52279 2.72151 4.68689
## sqft_basement yr_built yr_renovated
## 0.02922 -0.20745 0.03924
We then calculate the R squared of lasso regression, which is 0.648.
Lasso regression is also a good tool for feature selection. So we build a linear model by using lasso to select variables.
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + condition + grade + sqft_basement + yr_built + yr_renovated,
## data = kc_house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2297337 -107678 -12103 84167 4423367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5786132.6014 287169.2537 20.15 < 0.0000000000000002 ***
## bedrooms2 5000.7996 16364.1250 0.31 0.75992
## bedrooms3 -42266.0406 16372.6413 -2.58 0.00984 **
## bedrooms4 -77996.4955 16713.0612 -4.67 0.00000307769658706 ***
## bedrooms5 -79967.7532 17595.1577 -4.54 0.00000552672670617 ***
## bedrooms6 -145434.5049 21451.2933 -6.78 0.00000000001234956 ***
## bedrooms7 -261655.2281 39396.0402 -6.64 0.00000000003175618 ***
## bedrooms8 -55905.0419 62457.8929 -0.90 0.37075
## bedrooms9 -257155.1102 97573.2300 -2.64 0.00841 **
## bedrooms10 -228865.1782 125830.2877 -1.82 0.06895 .
## bedrooms11 -376317.8418 213834.1827 -1.76 0.07845 .
## bathrooms0.75 120775.3931 109765.0661 1.10 0.27121
## bathrooms1 96789.8822 106610.1309 0.91 0.36395
## bathrooms1.25 198630.3519 128075.2853 1.55 0.12094
## bathrooms1.5 100835.8044 106757.7965 0.94 0.34491
## bathrooms1.75 106680.6104 106699.1434 1.00 0.31741
## bathrooms2 106596.2395 106735.7985 1.00 0.31795
## bathrooms2.25 126046.4763 106764.4372 1.18 0.23777
## bathrooms2.5 107924.7623 106735.6188 1.01 0.31196
## bathrooms2.75 127022.2004 106878.5555 1.19 0.23466
## bathrooms3 148479.3603 106994.8134 1.39 0.16524
## bathrooms3.25 220325.3718 107135.6109 2.06 0.03975 *
## bathrooms3.5 171417.8775 107095.7088 1.60 0.10948
## bathrooms3.75 300179.5560 108209.0164 2.77 0.00554 **
## bathrooms4 270502.4083 108467.9347 2.49 0.01264 *
## bathrooms4.25 376695.2129 109667.6338 3.43 0.00059 ***
## bathrooms4.5 316003.6531 109134.8687 2.90 0.00379 **
## bathrooms4.75 656581.3498 116105.9686 5.66 0.00000001578048641 ***
## bathrooms5 474915.5948 116867.4050 4.06 0.00004846948630483 ***
## bathrooms5.25 593633.4268 122811.4895 4.83 0.00000134944579199 ***
## bathrooms5.5 727072.1630 127683.4780 5.69 0.00000001254684228 ***
## bathrooms5.75 545572.0030 153024.0622 3.57 0.00036 ***
## bathrooms6 1134991.7320 139869.6650 8.11 0.00000000000000051 ***
## bathrooms6.25 650879.2683 188349.5092 3.46 0.00055 ***
## bathrooms6.5 -17186.0546 185503.2678 -0.09 0.92619
## bathrooms6.75 567853.4806 186810.9333 3.04 0.00237 **
## bathrooms7.5 145092.2356 256965.4910 0.56 0.57233
## bathrooms7.75 3828379.1387 248399.0807 15.41 < 0.0000000000000002 ***
## bathrooms8 1570473.5799 191547.0917 8.20 0.00000000000000026 ***
## sqft_living 138.0847 3.8255 36.10 < 0.0000000000000002 ***
## sqft_lot -0.2596 0.0362 -7.16 0.00000000000081743 ***
## floors1.5 19188.7704 5780.9824 3.32 0.00090 ***
## floors2 27629.1245 4848.8142 5.70 0.00000001227173559 ***
## floors2.5 123157.5631 17500.7659 7.04 0.00000000000201930 ***
## floors3 136363.4875 9853.8560 13.84 < 0.0000000000000002 ***
## floors3.5 194254.8526 81197.8697 2.39 0.01675 *
## condition2 -19588.7798 42913.0952 -0.46 0.64805
## condition3 5853.5782 39923.4713 0.15 0.88343
## condition4 29077.6308 39932.7519 0.73 0.46652
## condition5 74263.2141 40168.0351 1.85 0.06450 .
## grade4 47977.1099 217441.7739 0.22 0.82537
## grade5 58346.9150 215109.2325 0.27 0.78621
## grade6 103493.8750 214945.4149 0.48 0.63017
## grade7 201590.6959 214977.4686 0.94 0.34839
## grade8 308137.7426 215017.3003 1.43 0.15185
## grade9 465726.1302 215080.0727 2.17 0.03037 *
## grade10 645649.1516 215182.1027 3.00 0.00270 **
## grade11 892622.0158 215462.4534 4.14 0.00003443713090282 ***
## grade12 1337778.9614 216619.3422 6.18 0.00000000067044514 ***
## grade13 2017629.0156 225570.4451 8.94 < 0.0000000000000002 ***
## sqft_basement 37.9302 4.6529 8.15 0.00000000000000038 ***
## yr_built -3014.3776 79.9195 -37.72 < 0.0000000000000002 ***
## yr_renovated 36.2602 3.8886 9.32 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 213000 on 21533 degrees of freedom
## Multiple R-squared: 0.665, Adjusted R-squared: 0.664
## F-statistic: 689 on 62 and 21533 DF, p-value: <0.0000000000000002
## bedrooms2 bedrooms3 bedrooms4 bedrooms5 bedrooms6
## 14.22 31.67 28.90 10.13 2.73
## bedrooms7 bedrooms8 bedrooms9 bedrooms10 bedrooms11
## 1.30 1.12 1.26 1.05 1.01
## bathrooms0.75 bathrooms1 bathrooms1.25 bathrooms1.5 bathrooms1.75
## 18.81 793.56 3.26 339.08 657.43
## bathrooms2 bathrooms2.25 bathrooms2.5 bathrooms2.75 bathrooms3
## 441.80 466.05 1015.12 282.29 183.58
## bathrooms3.25 bathrooms3.5 bathrooms3.75 bathrooms4 bathrooms4.25
## 145.10 178.74 39.76 35.08 20.89
## bathrooms4.5 bathrooms4.75 bathrooms5 bathrooms5.25 bathrooms5.5
## 26.16 6.83 6.32 4.32 3.60
## bathrooms5.75 bathrooms6 bathrooms6.25 bathrooms6.5 bathrooms6.75
## 2.07 2.59 1.57 1.52 1.54
## bathrooms7.5 bathrooms7.75 bathrooms8 sqft_living sqft_lot
## 1.46 1.36 1.62 5.88 1.07
## floors1.5 floors2 floors2.5 floors3 floors3.5
## 1.28 2.64 1.08 1.27 1.02
## condition2 condition3 condition4 condition5 grade4
## 6.85 172.97 147.24 55.76 28.13
## grade5 grade6 grade7 grade8 grade9
## 244.31 1881.57 5348.36 4449.50 2345.99
## grade10 grade11 grade12 grade13 sqft_basement
## 1097.76 401.17 91.77 14.59 2.02
## yr_built yr_renovated
## 2.63 1.16
The condition p-values are all lower than 0.05. We consequently remove them from the model and rebuild it.
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + grade + sqft_basement + yr_built + yr_renovated,
## data = kc_house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2299861 -107724 -11657 84378 4411029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6390398.7826 281343.4047 22.71 < 0.0000000000000002 ***
## bedrooms2 8793.1555 16411.3820 0.54 0.59210
## bedrooms3 -36077.6335 16412.8287 -2.20 0.02795 *
## bedrooms4 -71532.7566 16754.2911 -4.27 0.00001967336217461 ***
## bedrooms5 -74411.1623 17640.4167 -4.22 0.00002472445036141 ***
## bedrooms6 -142953.7615 21515.0023 -6.64 0.00000000003117787 ***
## bedrooms7 -258423.2081 39524.7387 -6.54 0.00000000006363165 ***
## bedrooms8 -55473.9576 62666.8420 -0.89 0.37605
## bedrooms9 -278523.9215 97883.1397 -2.85 0.00444 **
## bedrooms10 -227727.6995 126239.8568 -1.80 0.07126 .
## bedrooms11 -383449.7239 214555.7074 -1.79 0.07392 .
## bathrooms0.75 130401.2659 110127.9946 1.18 0.23639
## bathrooms1 101396.6106 106966.1914 0.95 0.34318
## bathrooms1.25 214387.1242 128495.1322 1.67 0.09524 .
## bathrooms1.5 109925.9055 107111.6944 1.03 0.30477
## bathrooms1.75 118404.4415 107051.3532 1.11 0.26872
## bathrooms2 119919.4240 107086.1173 1.12 0.26279
## bathrooms2.25 139025.1153 107115.6382 1.30 0.19434
## bathrooms2.5 120305.1066 107087.1231 1.12 0.26127
## bathrooms2.75 142766.2980 107227.2671 1.33 0.18306
## bathrooms3 161875.0630 107346.0986 1.51 0.13158
## bathrooms3.25 234468.4101 107486.7802 2.18 0.02917 *
## bathrooms3.5 183845.2189 107447.5300 1.71 0.08709 .
## bathrooms3.75 313888.2729 108564.3299 2.89 0.00384 **
## bathrooms4 285206.2049 108822.5754 2.62 0.00878 **
## bathrooms4.25 391992.6328 110026.5062 3.56 0.00037 ***
## bathrooms4.5 330613.4208 109491.8211 3.02 0.00253 **
## bathrooms4.75 670750.4580 116488.1711 5.76 0.00000000862231213 ***
## bathrooms5 486169.3041 117254.1868 4.15 0.00003392061401319 ***
## bathrooms5.25 608173.3283 123212.7876 4.94 0.00000080353413071 ***
## bathrooms5.5 740445.7587 128106.9263 5.78 0.00000000757747603 ***
## bathrooms5.75 555121.0099 153534.8034 3.62 0.00030 ***
## bathrooms6 1146755.7758 140335.5532 8.17 0.00000000000000032 ***
## bathrooms6.25 663880.3834 188978.6532 3.51 0.00044 ***
## bathrooms6.5 -4874.1796 186123.2253 -0.03 0.97911
## bathrooms6.75 567544.9154 187437.6867 3.03 0.00247 **
## bathrooms7.5 179930.9422 257815.0502 0.70 0.48524
## bathrooms7.75 3844440.4336 249230.9862 15.43 < 0.0000000000000002 ***
## bathrooms8 1580314.5517 192191.3916 8.22 < 0.0000000000000002 ***
## sqft_living 138.6162 3.8359 36.14 < 0.0000000000000002 ***
## sqft_lot -0.2583 0.0363 -7.11 0.00000000000119662 ***
## floors1.5 17634.6404 5788.3156 3.05 0.00232 **
## floors2 23824.3958 4833.3356 4.93 0.00000083187705496 ***
## floors2.5 120957.7694 17553.5690 6.89 0.00000000000570189 ***
## floors3 133102.5298 9863.2196 13.49 < 0.0000000000000002 ***
## floors3.5 195218.9047 81467.8489 2.40 0.01657 *
## grade4 -10520.8101 218092.4299 -0.05 0.96153
## grade5 10015.4782 215780.0375 0.05 0.96298
## grade6 55094.9696 215622.1274 0.26 0.79833
## grade7 152646.3785 215652.8471 0.71 0.47906
## grade8 258817.8723 215692.0607 1.20 0.23018
## grade9 416117.1073 215755.0687 1.93 0.05379 .
## grade10 594836.1060 215856.6898 2.76 0.00586 **
## grade11 840338.3056 216135.2163 3.89 0.00010 ***
## grade12 1285016.8472 217296.0725 5.91 0.00000000339622408 ***
## grade13 1959427.7048 226272.2209 8.66 < 0.0000000000000002 ***
## sqft_basement 40.3123 4.6630 8.65 < 0.0000000000000002 ***
## yr_built -3295.5874 76.1924 -43.25 < 0.0000000000000002 ***
## yr_renovated 28.4576 3.8364 7.42 0.00000000000012345 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 214000 on 21537 degrees of freedom
## Multiple R-squared: 0.662, Adjusted R-squared: 0.662
## F-statistic: 729 on 58 and 21537 DF, p-value: <0.0000000000000002
## bedrooms2 bedrooms3 bedrooms4 bedrooms5 bedrooms6
## 14.22 31.67 28.90 10.13 2.73
## bedrooms7 bedrooms8 bedrooms9 bedrooms10 bedrooms11
## 1.30 1.12 1.26 1.05 1.01
## bathrooms0.75 bathrooms1 bathrooms1.25 bathrooms1.5 bathrooms1.75
## 18.81 793.56 3.26 339.08 657.43
## bathrooms2 bathrooms2.25 bathrooms2.5 bathrooms2.75 bathrooms3
## 441.80 466.05 1015.12 282.29 183.58
## bathrooms3.25 bathrooms3.5 bathrooms3.75 bathrooms4 bathrooms4.25
## 145.10 178.74 39.76 35.08 20.89
## bathrooms4.5 bathrooms4.75 bathrooms5 bathrooms5.25 bathrooms5.5
## 26.16 6.83 6.32 4.32 3.60
## bathrooms5.75 bathrooms6 bathrooms6.25 bathrooms6.5 bathrooms6.75
## 2.07 2.59 1.57 1.52 1.54
## bathrooms7.5 bathrooms7.75 bathrooms8 sqft_living sqft_lot
## 1.46 1.36 1.62 5.88 1.07
## floors1.5 floors2 floors2.5 floors3 floors3.5
## 1.28 2.64 1.08 1.27 1.02
## condition2 condition3 condition4 condition5 grade4
## 6.85 172.97 147.24 55.76 28.13
## grade5 grade6 grade7 grade8 grade9
## 244.31 1881.57 5348.36 4449.50 2345.99
## grade10 grade11 grade12 grade13 sqft_basement
## 1097.76 401.17 91.77 14.59 2.02
## yr_built yr_renovated
## 2.63 1.16
The R squared value is 0.662, which is better than what was achieved with ridge and lasso regression.
7 Conclusion
In the end, each of our models comes with its advantages and disadvantages. Although KNN proved 74% accurate at classifying prices into “low”, “medium” and “high” categories, these categories ultimately do not tell us much, considering their large ranges. Decision trees provide simple visualizations and tell us which features are the most import, but it oversimplies the dataset and yields a low amount of variance explained (even with the random forest included). PCA and PCR yield relatively low amounts of variance explained (around 60%), and they do not differ much from the variance explained by a full linear model. The ridge and lasso regressions also do not perform as well as the full linear model (they have lower R2 values). To summarize our findings, in general, linear regression tends to offer the most explanatory power, and “sqft_living” and “grade” seem to influence price the most. This makes intuitive sense: living space and quality of construction are the most important variables when it comes to housing price, and the simple linear nature of this regression problem accepts this simple predictive model as quite efficient and strong indeed.
8 Bibliography
Dataset available: https://www.kaggle.com/harlfoxem/housesalesprediction/data
Perry, M. J. (2016, June 5). New US homes today are 1,000 square feet larger than in 1973 and living space per person has nearly doubled. Retrieved from https://www.aei.org/carpe-diem/new-us-homes-today-are-1000-square-feet-larger-than-in-1973-and-living-space-per-person-has-nearly-doubled/
Roberts, D. (n.d.). Variance and Standard Deviation. Retrieved from https://mathbitsnotebook.com/Algebra1/StatisticsData/STSD.html
Rosenberg, M. (2018, July 31). Seattle-area home prices this spring rose at fastest rate since 2006 bubble. The Seattle Times. Retrieved from https://www.seattletimes.com/business/real-estate/seattle-area-home-prices-this-spring-rose-at-fastest-rate-since-2006-bubble/